AITopics | context image

Collaborating Authors

context image

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

NL-Eye: Abductive NLI for Images

Ventura, Mor, Toker, Michael, Calderon, Nitay, Gekhman, Zorik, Bitton, Yonatan, Reichart, Roi

arXiv.org Artificial IntelligenceOct-3-2024

Will a Visual Language Model (VLM)-based bot warn us about slipping if it detects a wet floor? Recent VLMs have demonstrated impressive capabilities, yet their ability to infer outcomes and causes remains underexplored. To address this, we introduce NL-Eye, a benchmark designed to assess VLMs' visual abductive reasoning skills. NL-Eye adapts the abductive Natural Language Inference (NLI) task to the visual domain, requiring models to evaluate the plausibility of hypothesis images based on a premise image and explain their decisions. NL-Eye consists of 350 carefully curated triplet examples (1,050 images) spanning diverse reasoning categories: physical, functional, logical, emotional, cultural, and social. The data curation process involved two steps - writing textual descriptions and generating images using text-to-image models, both requiring substantial human involvement to ensure high-quality and challenging scenes. Our experiments show that VLMs struggle significantly on NL-Eye, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality. This demonstrates a deficiency in the abductive reasoning capabilities of modern VLMs. NL-Eye represents a crucial step toward developing VLMs capable of robust multimodal reasoning for real-world applications, including accident-prevention bots and generated video verification.

explanation, hypothesis, reasoning, (14 more...)

arXiv.org Artificial Intelligence

2410.02613

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(2 more...)

Genre:

Workflow (1.00)
Research Report > New Finding (0.67)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
(3 more...)

Add feedback

Disrupting Diffusion-based Inpainters with Semantic Digression

Son, Geonho, Lee, Juhun, Woo, Simon S.

arXiv.org Artificial IntelligenceJul-14-2024

The fabrication of visual misinformation on the web and social media has increased exponentially with the advent of foundational text-to-image diffusion models. Namely, Stable Diffusion inpainters allow the synthesis of maliciously inpainted images of personal and private figures, and copyrighted contents, also known as deepfakes. To combat such generations, a disruption framework, namely Photoguard, has been proposed, where it adds adversarial noise to the context image to disrupt their inpainting synthesis. While their framework suggested a diffusion-friendly approach, the disruption is not sufficiently strong and it requires a significant amount of GPU and time to immunize the context image. In our work, we re-examine both the minimal and favorable conditions for a successful inpainting disruption, proposing DDD, a "Digression guided Diffusion Disruption" framework. First, we identify the most adversarially vulnerable diffusion timestep range with respect to the hidden space. Within this scope of noised manifold, we pose the problem as a semantic digression optimization. We maximize the distance between the inpainting instance's hidden states and a semantic-aware hidden state centroid, calibrated both by Monte Carlo sampling of hidden states and a discretely projected optimization in the token space. Effectively, our approach achieves stronger disruption and a higher success rate than Photoguard while lowering the GPU memory requirement, and speeding the optimization up to three times faster.

context image, disruption, optimization, (13 more...)

arXiv.org Artificial Intelligence

2407.10277

Country: Asia > Nepal (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (0.90)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

DeepCompass: AI-driven Location-Orientation Synchronization for Navigating Platforms

Lee, Jihun, Choi, SP, Kang, Bumsoo, Seok, Hyekyoung, Ahn, Hyoungseok, Jung, Sanghee

arXiv.org Artificial IntelligenceSep-15-2023

In current navigating platforms, the user's orientation is typically estimated based on the difference between two consecutive locations. In other words, the orientation cannot be identified until the second location is taken. This asynchronous location-orientation identification often leads to our real-life question: Why does my navigator tell the wrong direction of my car at the beginning? We propose DeepCompass to identify the user's orientation by bridging the gap between the street-view and the user-view images. First, we explore suitable model architectures and design corresponding input configuration. Second, we demonstrate artificial transformation techniques (e.g., style transfer and road segmentation) to minimize the disparity between the street-view and the user's real-time experience. We evaluate DeepCompass with extensive evaluation in various driving conditions. DeepCompass does not require additional hardware and is also not susceptible to external interference, in contrast to magnetometer-based navigator. This highlights the potential of DeepCompass as an add-on to existing sensor-based orientation detection methods.

deepcompass, orientation, target image, (15 more...)

arXiv.org Artificial Intelligence

2311.12805

Country: North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Transportation (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

On the Adversarial Robustness of Multi-Modal Foundation Models

Schlarmann, Christian, Hein, Matthias

arXiv.org Artificial IntelligenceAug-21-2023

Multi-modal foundation models combining vision and language models such as Flamingo or GPT-4 have recently gained enormous interest. Alignment of foundation models is used to prevent models from providing toxic or harmful output. While malicious users have successfully tried to jailbreak foundation models, an equally important question is if honest users could be harmed by malicious third-party content. In this paper we show that imperceivable attacks on images in order to change the caption output of a multi-modal foundation model can be used by malicious content providers to harm honest users e.g. by guiding them to malicious websites or broadcast fake information. This indicates that countermeasures to adversarial attacks should be used by any deployed multi-modal foundation model.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2308.10741

Country:

Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
Europe > Netherlands > North Holland > Amsterdam (0.05)
North America > United States > New York (0.04)
(3 more...)

Genre: Research Report (0.64)

Industry:

Information Technology > Security & Privacy (0.67)
Government > Military (0.49)
Health & Medicine > Therapeutic Area (0.48)
Leisure & Entertainment > Sports > Tennis (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

MEWL: Few-shot multimodal word learning with referential uncertainty

Jiang, Guangyuan, Xu, Manjie, Xin, Shiji, Liang, Wei, Peng, Yujia, Zhang, Chi, Zhu, Yixin

arXiv.org Artificial IntelligenceJun-1-2023

Without explicit feedback, humans can rapidly learn the meaning of words. Children can acquire a new word after just a few passive exposures, a process known as fast mapping. This word learning capability is believed to be the most fundamental building block of multimodal understanding and reasoning. Despite recent advancements in multimodal learning, a systematic and rigorous evaluation is still missing for human-like word learning in machines. To fill in this gap, we introduce the MachinE Word Learning (MEWL) benchmark to assess how machines learn word meaning in grounded visual scenes. MEWL covers human's core cognitive toolkits in word learning: cross-situational reasoning, bootstrapping, and pragmatic learning. Specifically, MEWL is a few-shot benchmark suite consisting of nine tasks for probing various word learning capabilities. These tasks are carefully designed to be aligned with the children's core abilities in word learning and echo the theories in the developmental literature. By evaluating multimodal and unimodal agents' performance with a comparative analysis of human performance, we notice a sharp divergence in human and machine word learning. We further discuss these differences between humans and machines and call for human-like few-shot word learning in machines.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.00503

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Education (1.00)
Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Efficient Zero-shot Visual Search via Target and Context-aware Transformer

Ding, Zhiwei, Ren, Xuezhe, David, Erwan, Vo, Melissa, Kreiman, Gabriel, Zhang, Mengmi

arXiv.org Artificial IntelligenceNov-24-2022

Visual search is a ubiquitous challenge in natural vision, including daily tasks such as finding a friend in a crowd or searching for a car in a parking lot. Human rely heavily on relevant target features to perform goal-directed visual search. Meanwhile, context is of critical importance for locating a target object in complex scenes as it helps narrow down the search area and makes the search process more efficient. However, few works have combined both target and context information in visual search computational models. Here we propose a zero-shot deep learning architecture, TCT (Target and Context-aware Transformer), that modulates self attention in the Vision Transformer with target and contextual relevant information to enable human-like zero-shot visual search performance. Target modulation is computed as patch-wise local relevance between the target and search images, whereas contextual modulation is applied in a global fashion. We conduct visual search experiments on TCT and other competitive visual search models on three natural scene datasets with varying levels of difficulty. TCT demonstrates human-like performance in terms of search efficiency and beats the SOTA models in challenging visual search tasks. Importantly, TCT generalizes well across datasets with novel objects without retraining or fine-tuning. Furthermore, we also introduce a new dataset to benchmark models for invariant visual search under incongruent contexts. TCT manages to search flexibly via target and context modulation, even under incongruent contexts.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2211.1347

Country:

North America > United States > Texas > Harris County > Houston (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)

Add feedback

Google's DeepMind AI can 'transframe' a single image into a video

#artificialintelligenceAug-19-2022, 11:53:44 GMT

Earlier this week, the team behind Google's advanced DeepMind neural network unveiled a new ability dubbed Transframer, which allows AI to generate 30-second videos from a single image input. It's a nifty little trick at first glance, but the implications are much larger than an interesting .GIF file. Transframer is a general-purpose generative framework that can handle many image and video tasks in a probabilistic setting. New work shows it excels in video prediction and view synthesis, and can generate 30s videos from a single image: https://t.co/wX3nrrYEEa "Transframer is state-of-the-art on a variety of video generation benchmarks, and… can generate coherent 30 second videos from a single image without any explicit geometric information," the DeepMind research team explains.

single image, transframer, video, (6 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Graph-Based Continual Learning

Tang, Binh, Matteson, David S.

arXiv.org Machine LearningJul-9-2020

Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to guard against forgetting. Empirical results on several benchmark datasets show that our model consistently outperforms recently proposed baselines for task-free continual learning.

arxiv preprint arxiv, learning, mnist, (12 more...)

arXiv.org Machine Learning

2007.04813

Country: